#Introduction to Open Data Science - Course Project
Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Mon Nov 14 17:16:49 2022"
The text continues here.
Open the file chapter1.Rmd located in your IODS-project folder with RStudio. Just write some of your thoughts about this course freely in the file, e.g.,
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Mon Nov 14 17:16:49 2022"
1. How are you feeling right now?
I am excited about the course. It is a bit intimidating, since it seems that there is quite a lot of work. But, I cant wait to be able to apply what I will learn here to my own research.
2. What do you expect to learn?
I am a PhD student and I found the course content very fitting for my needs. I am the most excited to learn more about GitHub and R Markdown. Also, the classes regarding model validation (2), clustering/classification (4), and dimensionality reduction techniques (5) are very interesting topics for me, since I am doing my PhD on psychometric validation of (Short) Warwick-Edinburgh Mental Well-being Scale, (S)WEMWBS, among Finnish population.
3. Where did you hear about the course?
I remember taking one of the Kimmo’s course almost 10 years ago, when I was an undergraduate. Even though the course at the time was held in very early morning, I really enjoyed his class. I live in Austarlia, and When I noticed (in Sisu) that he is holding an online course again, I signed up immediately.
Also reflect on your learning experiences with the R for Health Data Science book and the Exercise Set 1:
4. How did it work as a “crash course” on modern R tools and using RStudio?
I have used RStudio before, so I am familiar with the program and I had everything already installed. However, this is the first time when I will be using R Markdown and GitHub.
I also have another statistics course at the moment where will be using R Markdown, so I am excited to learn the syntax and get familiar with the program, along with GitHub, to see how I can use it in my own research. GitHub for example, could work really well, when I have multiple different scripts that I will be testing/editing. I also really like the layout of R Markdown, it is so much easier to follow when you knit it, than normal R script. The interactive features are amazing and can be really cool to add as supplementary material into your manuscript, so people could view different scenarios and examine the topic a bit more deeper.
Also, the R Markdown Tutorial was very helpful. R Markdown Tutorial
5. Which were your favorite topics?
I really like the layout of R Markdown and the Cheat Sheet Cheat Sheet
Also, I found that the R for Health Data Science book very helpful and I know that I will be using it a lot in the future. It seems to have very illustrative examples and code that I can adapt to my own research. I think it will be much more useful when later on we have actual exercises when we need to write our own code. I tend to use Stack Overflow, general Googling, and other peoples code as dictionary or grammar book, when I need to solve some issues with my code. In my opinion you learn the best when you are simultaneously trying to apply the piece of code to solve a problem. Just reading/viewing it is also helpful, but it is hard to grasp all the information at once without a specific task you try to solve.
6. Which topics were most difficult?
I think I have okay understanding of R and RStudio. I know how to “read” and “edit” most of the code, intall and use new packages, etc. The difficult part is when you have an idea what you want to do, and you try to find the best way to edit the code (for example, getting certain colours, divide data based on stratas etc.). Sometimes the packages have different syntax than the “normal” R code, even the syntax in R Markdown is different (e.g., how to mark comments).
However, I found the example code in Exercise1.Rmd very helpful to get started. I would prefer if R for Health Data Science would also have a PDF version, since I prefer to have a copy saved on my personal laptop, so I could highlight and add comments to the text. Also, if I understood it correctly, the book is based on around using the tidyverse-package, since pipe %>% is a part of this package, and would not work if you don’t have tidyverse() installed. There are many ways to write the R code by using different packages and some are using the basic R code and some their own, and sometimes they are mixed. Having a tutorial that would help to understand which syntax you need/can use would be very beneficial.
However, I have not used the GitHub before, so I found it quite difficult to get it started and understand the layout and what things are saved to my personal computer/files and which are online. “Committing” and “Pushing” things to GitHub seemed also quite hard at the start.
I also find it challenging to learn/understand the YAML at the start of R Markdown script, and how to edit them
For example, my index.Rmd code did not run at the start and trying to find the ways to fix it was difficult. In the end it just worked even though I did not change anything - I think it was trying to knit the script into something else than html.
Also add in the file a link to your GitHub repository (that you created earlier): https://github.com/your_github_username/IODS-project
Remember to save your chapter1.Rmd file.
Open the index.Rmd file with RStudio.
At the beginning of the file, in the YAML options below the ‘title’ option, add the following option: author: “Your Name”. Save the file and “knit” the document (there’s a button for that) as an HTML page. This will also update the index.html file.
index.Rmd error code
Error in yaml::yaml.load(…, eval.expr =
TRUE) : Parser error: while parsing a block mapping at line 1, column 1
did not find expected key at line 3, column 3 Calls:
To make the connection between RStudio and GitHub as smooth as possible, you should create a Personal Access Token (PAT).
The shortest way to proceed is to follow the steps below. (Source: https://happygitwithr.com/https-pat.html)
Execute the R commands (preceded by ‘>’) in the RStudio Console (below the Editor):
> install.packages(“usethis”) > usethis::create_github_token()
GitHub website will open in your browser. Log in with your GitHub credentials.
Return to RStudio and continue in the Console:
> gitcreds::gitcreds_set()
Apparently, I already had PAT, but I decided to update it, so I could finish this assignment. Now you should be able to work with GitHub, i.e., push and pull from RStudio.
Upload the changes to GitHub (the version control platform) from RStudio. There are a few phases (don’t worry: all this will become an easy routine for you very soon!):
Note: It is useful to make commits often and even on
small changes.
Commits are at the heart of the version control system, as a single
commit represents a single version of the file.)
After a few moments, go to your GitHub repository at https://github.com/your_github_username/IODS-project to see what has changed (please be patient and refresh the page).
Also visit your course diary that has been automatically been updated at https://your_github_username.github.io/IODS-project and make sure you see the changes there as well.
After completing the tasks above you are ready to submit your
Assignment for the review (using the Moodle Workshop below).
Have the two links (your GitHub repository and your course
diary) ready!
Remember to get back there when the Review phase begins (see course
schedule).
| End of Assignment 1: Tasks and Instructions |
| *** |
Describe the work you have done this week and summarize your learning.
date()
## [1] "Mon Nov 14 17:16:50 2022"
TASK INSTRUCTIONS: Create a folder named ‘data’ in your IODS-project folder. Then create a new R script with RStudio. Write your name, date and a one sentence file description as a comment on the top of the script file. Save the script for example as create_learning2014.R in the data folder. Complete the rest of the steps in that script.
Figure demonstrates how to create a new folder.
Please see create_learning2014.R and lrn14_KS.csv to evaluate the Data wrangling from my a GitHub repository: https://github.com/kiirasar/IODS-project you can find the files in data folder.
First we install/use R packages we need to complete the assignment.
# Select (with mouse or arrow keys) the install.packages("...") and
# run it (by Ctrl+Enter / Cmd+Enter):
# install.packages("GGally")
#install.packages("GGally")
#install.packages("tidyverse")
#install.packages('readr')
#install.packages('ggplot2')
#install.packages("psych")
#install.packages("vtable")
library(vtable)
## Warning: package 'vtable' was built under R version 4.2.2
## Loading required package: kableExtra
## Warning: package 'kableExtra' was built under R version 4.2.2
library(psych)
## Warning: package 'psych' was built under R version 4.2.2
library(GGally)
## Warning: package 'GGally' was built under R version 4.2.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.2.2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.2
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ✔ purrr 0.3.4
## Warning: package 'readr' was built under R version 4.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::%+%() masks psych::%+%()
## ✖ ggplot2::alpha() masks psych::alpha()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::group_rows() masks kableExtra::group_rows()
## ✖ dplyr::lag() masks stats::lag()
library(readr)
library(ggplot2)
TASK INSTRUCTIONS: Read the students2014 data into R either from your local folder (if you completed the Data wrangling part) or from this url.
Explore the structure and the
dimensions of the data and describe the dataset
briefly, assuming the reader has no previous knowledge of it.
Information related to data can be found
here
# Read the data from your local drive using setwd()-command
# setwd('C:\\Users\\Kiira\\Documents\\PhD_SWEMWBS\\PhD Courses\\Courses in 2022\\PHD-302 Open Data Science\\IODS-project')
# lrn14 <- read_csv("data/lrn14_KS.csv")
# head(lrn14) #gender, age, A_att, A_deep, A_stra, A_surf, points
# View(lrn14)
# or from url
std14 <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/learning2014.txt", sep=",", header=T) # sep=separator is a comma, header=T
head(std14) #gender, age, attitude, deep, stra, surf, points
## gender age attitude deep stra surf points
## 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 M 49 3.7 3.666667 3.625 2.833333 22
## 6 F 38 3.8 4.750000 3.625 2.416667 21
View(std14)
head() command is used to show the first 6 lines of the datase, whereas View() opens the whole dataset into a new tab.
NOTE. data from local drive is
named as lrn14 and data from url as
std14.
Read_csv-command worked on R before, but I could not knit the
document for some reason. This is why its only there as comments and I
decide use the url (std14) dataset. The data is exact same, only the
variable names are different. I will use the url data to complete the
assignment.
dim(std14)
## [1] 166 7
dim() is R function to explore the dimension of the dataset. The dataset has 166 rows (observations) and 7 columns (variables).You can read the name of the variables or have better look at the data by using head(std14) and View(std14)
str(std14)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
str() is R function to explore the structure of the dataset. The dataframe has 166 observations and 7 variables, like in dim().
TASK INSTRUCTIONS: Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.
SUMMARY STATISTICS:
To explore the summaries of each variable I used
vtable-package and st()-command, also
know as sumtable().
Here is link
where you can find more information regarding vtable
and st()-command
st(std14) # the command prints a summary statistics table to Viewer-window
| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|
| gender | 166 | ||||||
| … F | 110 | 66.3% | |||||
| … M | 56 | 33.7% | |||||
| age | 166 | 25.512 | 7.766 | 17 | 21 | 27 | 55 |
| attitude | 166 | 3.143 | 0.73 | 1.4 | 2.6 | 3.7 | 5 |
| deep | 166 | 3.68 | 0.554 | 1.583 | 3.333 | 4.083 | 4.917 |
| stra | 166 | 3.121 | 0.772 | 1.25 | 2.625 | 3.625 | 5 |
| surf | 166 | 2.787 | 0.529 | 1.583 | 2.417 | 3.167 | 4.333 |
| points | 166 | 22.717 | 5.895 | 7 | 19 | 27.75 | 33 |
Dataset std14 has a total of 166 observations (participants) and 7 variables (gender, age, attitude, deep, stra, surf and points). In the dataset:
NOTE. The different learning methods (deep, stra, surf) are average based on several items for each learning method. The summary display the basic descriptive statistics: mean, standard deviation, minim, lower and upper quartiles (25% and 75%) and maximum. The scale among learning techniques are 1-5.
Points denotes the students exam points in a statistics course exam.
BARPLOTS - Nominal variables:
I used ggplot-package and barplot to explore the distributions and counts based of gender (nominal)
# ggplot()=command, std14=dataframe, eas(x=variable) + type of plot
gg_gender <- ggplot(std14, aes(x=gender)) + geom_bar() #barplot for nominal variables.
gg_gender
# you can make the plots looking prettier by adding extra code:
ggplot(std14, aes(x=as.factor(gender), fill=as.factor(gender) )) +
geom_bar(aes(fill=gender)) +
geom_text(stat='count',aes(label=..count..),vjust=-0.3) + #Adding counts on top of the bars
labs(x = "", fill = "gender") + #filling bars based on gender
ggtitle("Barplot based on gender Learning 2014 dataset") + #adding title
ylab("count")+ xlab("gender") + #adding x and y labels
scale_x_discrete(labels=c("F" = "Female", "M" = "Male")) #changing F into female and M into male
According to the previous summary table and barplot dataset std14 has 110 female and 56 male participants.
HISTOGRAMS - Continuous variables:
I made histograms for every continuous variable: age, attitude, deep, stra, surf, and points, in order to check if these are normally distributed - meaning that the distribution follows the bell curve. If variables are not normally distributed, we can’t use parametric statistical approaches e.g., general regression models, but rather non-parametric statistical methods.
NOTE. When making plots, it is important to include everyone. Some people might have difficulties see all the colours e.g., colour blind, so it is imporant to use right colours. On this website you can find inclusive colour pallets.
The #CODE are referring to certain colours.
Also, I wanted to print all the plot in one page by using multiplot()-command which is part of ggplot-package. Before using the command I needed to run a code that can be found here
multiplot <- function(..., plotlist = NULL, file, cols = 1, layout = NULL) {
require(grid)
plots <- c(list(...), plotlist)
numPlots = length(plots)
if (is.null(layout)) {
layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
ncol = cols, nrow = ceiling(numPlots/cols))
}
if (numPlots == 1) {
print(plots[[1]])
} else {
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
for (i in 1:numPlots) {
matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
layout.pos.col = matchidx$col))
}
}
}
Then I created a histogram for each continuous variable and used the colours from
# NOTE. you can also use "=" instead of "<-" to create objects. However, ofter "<-" is better, since some packages might use "=" for something else.
p1=ggplot(std14) +
geom_histogram(aes(x = age), fill = "#E69F00") +
labs(title="Age")
p2=ggplot(std14) +
geom_histogram(aes(x = attitude), fill = "#56B4E9") +
labs(title="Attitude")
p3=ggplot(std14) +
geom_histogram(aes(x = deep), fill = "#009E73")+
labs(title="Deep learning")
p4=ggplot(std14) +
geom_histogram(aes(x = stra), fill = "#F0E442")+
labs(title="Strategic learning")
p5=ggplot(std14) +
geom_histogram(aes(x = surf), fill = "#0072B2")+
labs(title="Surface learning")
p6=ggplot(std14) +
geom_histogram(aes(x = points), fill = "#D55E00")+
labs(title="Points")
Last, I ran the multiplot()-command.
multiplot(p1, p2, p3, p4, p5, p6, cols=3) #prints 3 columns
## Loading required package: grid
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Summary regarding the results:
Next, I wanted to create a histogram where I added all the learning strategies on top of each other.
# NOTE. alpha=.5, makes the colours trasparent 50%.
ggplot(std14) +
geom_histogram(aes(x = deep), fill = "#009E73", alpha=.5) + # green
geom_histogram(aes(x = stra), fill = "#F0E442", alpha=.5) + # yellow
geom_histogram(aes(x = surf), fill = "#0072B2", alpha=.5) + # blue
labs(title="Learnign strategies", x="Learning strategies (Mean)")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Summary regarding the results:
RELATIONSHIP between the variables:
ggpairs()-comman is part of ggplot and it creates more advanced plot matrix where you can explore the relationships between the variables.
ggpairs(std14, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
The command prints out:
Histograms (first 2 columns on left) based on gender (female, male)
Boxplots (first row) based on gender: female (top), male (bottom)
Normal distributions (diagonal) only for continuous variables
Correlations (up diagonal) only for continuous variables
Scatterplots - Relatinships between continuous variables
However, the figure is quite small, so it is easier to explore the scatterplots by using pairs() command.
# this piece of code excludes the gender (nominal variable)
pairs(std14[-1])
But even that is quite ugly. Also, the plots below and above the diagonial line are identical (just opposite scaling). To make the scatterplots nicer, we can create nicer scatterplots with ggplot.
Age scatterplots
# Age
sp1 <- ggplot(std14, aes(x = age, y = attitude)) +
geom_point() + #scatterplot
geom_smooth(method = "lm") + #regression line
labs(title="Scatterplot: age and attitude")
sp2 <- ggplot(std14, aes(x = age, y = deep)) +
geom_point() + #scatterplot
geom_smooth(method = "lm") + #regression line
labs(title="Scatterplot: age and deep learning")
sp3 <- ggplot(std14, aes(x = age, y = stra)) +
geom_point() + #scatterplot
geom_smooth(method = "lm") + #regression line
labs(title="Scatterplot: age and strategic learning")
sp4 <- ggplot(std14, aes(x = age, y = surf)) +
geom_point() + #scatterplot
geom_smooth(method = "lm") + #regression line
labs(title="Scatterplot: age and surface learning")
sp5 <- ggplot(std14, aes(x = age, y = points)) +
geom_point() + #scatterplot
geom_smooth(method = "lm") + #regression line
labs(title="Scatterplot: age and points")
multiplot(sp1, sp2, sp3, sp4, sp5, cols=3)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
# age and attitude, r=0.022
# age and deep learning, r=0.025
# age and strategic learning, r=0.102
# age and surface learning, r=-0.141
# age and points, r=-0.093
Since age is very skewed the correlation between other variables are very low. Meaning the scatterplots and regression line is very flat, indicating low or non-correlation. Age does not seem to be related to different learning techniques, attitudes or overall exam points. However, age was also very skewed meaning a lot of people were same age, that may affect the results.
Attitude scatterplots
## Attitude
sp6 <- ggplot(std14, aes(x = attitude, y = deep)) +
geom_point() + #scatterplot
geom_smooth(method = "lm") + #regression line
labs(title="Scatterplot: attitude and deep learning")
sp7 <- ggplot(std14, aes(x = attitude, y = stra)) +
geom_point() + #scatterplot
geom_smooth(method = "lm") + #regression line
labs(title="Scatterplot: attitude and strategic learning")
sp8 <- ggplot(std14, aes(x = attitude, y = surf)) +
geom_point() + #scatterplot
geom_smooth(method = "lm") + #regression line
labs(title="Scatterplot: attitude and surface learning")
sp9 <- ggplot(std14, aes(x = attitude, y = points)) +
geom_point() + #scatterplot
geom_smooth(method = "lm") + #regression line
labs(title="Scatterplot: attitude and points")
multiplot(sp6, sp7, sp8, sp9, cols=2)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
#attitude and deep learning, r=0.110
#attitude and strategic learning, r=0.062
#attitude and surface learning, r= -0.176*
#attitude and points, r=0.437***
The first two graphs (on left column) had very small correlation and
the line was very flat.
The relationship between attitude and surface learning (top
right) shows a small significant negative correlation
(r= -0.176), indicating that the line goes down: people with
higher surface learning points, would have higher chance to have lower
attitude points as well.
This could mean that people who use surface learning techniques have
worsen attitude towards learning in general.
Alternatively, the relationship between attitude and points
(down right) show significant positive correlation
(r=0.437); the line goes up. Indicating that individuals with
high attitude points would often also have high overall points - and
vice versa; individual with low attitude would also have low overall
points.
One interpretation of these finding is that people who have good
attitude towards learning will also success better in their exams.
Deep learning scatterplots
# deep
sp10 <- ggplot(std14, aes(x = deep, y = stra)) +
geom_point() + #scatterplot
geom_smooth(method = "lm") + #regression line
labs(title="Scatterplot: deep and strategic learning")
sp11 <- ggplot(std14, aes(x = deep, y = surf)) +
geom_point() + #scatterplot
geom_smooth(method = "lm") + #regression line
labs(title="Scatterplot: deep and surface learning")
sp12 <- ggplot(std14, aes(x = deep, y = points)) +
geom_point() + #scatterplot
geom_smooth(method = "lm") + #regression line
labs(title="Scatterplot: deep learning and points")
multiplot(sp10, sp11, sp12, cols=2)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
#deep and strategic learning, r=0.097
#deep and surface learning, r= -0.324***
#deep learning and points, r= -0.010
Only, relationship between deep and surface learning (bottom left) had significant correlation (r= -0.324). The correlation was also negative, meaning that higher deep learning scores were associated with lower surface learning scores and vice versa.
This could mean that people who often use deep learning techniques do rarely use surface learning techiques and vice versa.
Other relationship showed barely any correlation and therefore the line was fairy flat.
Strategic learning
# stra
sp13 <- ggplot(std14, aes(x = stra, y = surf)) +
geom_point() + #scatterplot
geom_smooth(method = "lm") + #regression line
labs(title="Scatterplot: strategic and surface learning")
sp14 <- ggplot(std14, aes(x = stra, y = points)) +
geom_point() + #scatterplot
geom_smooth(method = "lm") + #regression line
labs(title="Scatterplot: strategic learning and points")
sp15 <- ggplot(std14, aes(x = surf, y = points)) +
geom_point() + #scatterplot
geom_smooth(method = "lm") + #regression line
labs(title="Scatterplot: surface learning and points")
multiplot(sp13, sp14, sp15, cols=2)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
#strategic and surface learning, r= -0.161*
#strategic learning and points, r=0.146
#surface learning and points, r= -0.144
Only the association between strategic and surface
learning showed significant correlation, which was negative
(r=-0.161), meaning that lower points were associated with
higher surface learning points.
This could mean that people who only use surface learning techniques
will struggle to grasp more deeper understanding of different concepts
that could lead lower exam points.
Below there is a code and figure with all the scatterplots by using multiplot(), but yet again, the graphs are too small, so the interpretation of the findings is difficult.
multiplot(sp6, sp7, sp8, sp9, sp10, sp11, sp12, sp13, sp14, sp15, cols=4) #prints 5 columns
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
TASK INSTRUCTIONS: Choose three variables as explanatory variables and fit a regression model where exam points is the target (dependent, outcome) variable. Show a summary of the fitted model and comment and interpret the results. Explain and interpret the statistical test related to the model parameters. If an explanatory variable in your model does not have a statistically significant relationship with the target variable, remove the variable from the model and fit the model again without it. (0-4 points)
Using a summary of your fitted model, explain the relationship between the chosen explanatory variables and the target variable (interpret the model parameters). Explain and interpret the multiple R-squared of the model. (0-3 points)
I choose these three independent variables, since they are different strategies how people learn.
my_model3 <- lm(points ~ deep + stra + surf, data = std14)
# my_model3 #call the linear model, intercept and slopes.
summary(my_model3) #summary of the model, including the single variable statistical significant summaries
##
## Call:
## lm(formula = points ~ deep + stra + surf, data = std14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.1235 -3.0737 0.5226 4.2799 10.3229
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.9143 5.1169 5.260 4.5e-07 ***
## deep -0.7443 0.8662 -0.859 0.3915
## stra 0.9878 0.5962 1.657 0.0994 .
## surf -1.6296 0.9153 -1.780 0.0769 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.827 on 162 degrees of freedom
## Multiple R-squared: 0.04071, Adjusted R-squared: 0.02295
## F-statistic: 2.292 on 3 and 162 DF, p-value: 0.08016
Based on the model only strategic and surface learning are statistical significant (. = 0.10), but deep learning is not.
However, the model explain only 4-2.3% of the exam results (Multiple R-squared = 0.04071 and Adjusted R-square = 0.02295). Also the overall p-value of the whole model is relatively bad 0.08016.
Overall, it seems that different learning techniques will pay either none or only little role in explaining overall exam points.
Since, deep learning is not statistically significant, I will remove it from the model and fit the model again without it.
NOTE. Overall, p<.01 is not very good result, normally p<.05 is the level of statistical significant results at least in my research field (psychology).
my_model2 <- lm(points ~ stra + surf, data = std14) #exclude deep learning from the model
summary(my_model2) #summary of the model
##
## Call:
## lm(formula = points ~ stra + surf, data = std14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.4574 -3.2820 0.4296 4.0737 9.8147
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.5635 3.3104 7.118 3.31e-11 ***
## stra 0.9635 0.5950 1.619 0.107
## surf -1.3828 0.8684 -1.592 0.113
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.822 on 163 degrees of freedom
## Multiple R-squared: 0.03634, Adjusted R-squared: 0.02452
## F-statistic: 3.074 on 2 and 163 DF, p-value: 0.04895
When excluding the “deep learning” from the model, neither strategic or surface learning remain significant.
As additional task, I wanted to try a completely new model, where attitude, deep and surface learning could try to explain the exam points. I chose these, since they had the highest correlations. However, this might cause multi-collienarity that can impact on our results.
my_model32 <- lm(points ~ attitude + deep + surf, data = std14)
summary(my_model32) #summary of the model, including the single variable statistical significant summaries
##
## Call:
## lm(formula = points ~ attitude + deep + surf, data = std14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9168 -3.1487 0.3667 3.8326 11.3519
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.3551 4.7124 3.895 0.000143 ***
## attitude 3.4661 0.5766 6.011 1.18e-08 ***
## deep -0.9485 0.7903 -1.200 0.231815
## surf -1.0911 0.8360 -1.305 0.193669
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.313 on 162 degrees of freedom
## Multiple R-squared: 0.2024, Adjusted R-squared: 0.1876
## F-statistic: 13.7 on 3 and 162 DF, p-value: 5.217e-08
Interestingly, neither deep or surface learning were significant independent variables to predict/explain exam points. However, attitude was highly significant (p<.0001). The results might be caused by the high positive association/correlation between attitude and exam points.
Overall, it seem that attitude plays much more bigger role explaining exam results than learning techniques.
The new model is also much better than the previous (model3): R-square was 0.2024, meaning that this model explains 20% variation in exam points. The models p-value was also much better than before: p<.0001
Lastly, I excluded both learning techniques from the model to see if we could increase the model fit.
my_model1 <- lm(points ~ attitude, data = std14)
summary(my_model1) #summary of the model, including the single variable statistical significant summaries
##
## Call:
## lm(formula = points ~ attitude, data = std14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
This had a little impact:
TASK INSTRUCTIONS: Produce the following diagnostic plots:
Explain the assumptions of the model and interpret the validity of those assumptions based on the diagnostic plots.
FURTHER INSTRUCTIONS based on Exerice 2: R makes it
easy to graphically explore the validity of your model assumptions, by
using plot()-command e.g., plot(my_model3).
In the plot() function argument which can help you to
choose which plots you want. We will focus on plots 1,
2 and 5:
| which | graphic |
|---|---|
| 1 | Residuals vs Fitted values |
| 2 | Normal QQ-plot |
| 3 | Standardized residuals vs Fitted values |
| 4 | Cook’s distances |
| 5 | Residuals vs Leverage |
| 6 | Cook’s distance vs Leverage |
Before the call to the plot() function, add the
following: par(mfrow = c(2,2)). This will
place the following 4 graphics to the same plot.
RESIDUALS
In general the following graphs are focusing on exploring the residuals
of the model.
In statistical point of view, residual is the difference between predicted values of y (dependent variable) and observed values of y . So. Residual = actual y value − predicted y value, (ri=yi−^yi).
Another example explaining residuals is the distance from the linear line. If the observations is located above the linear line, residual is positive and if the observation is located below the line, it is negative.
If our model would explain 100% of the variation of dependent variable, residual would be 0, meaning that all the observations would be touching the linear line.
In a way, you could say that residual is a measure of how well a linear line fits an individual data points.
The picture above is a screenshot from khanacademy.org
R-square on the other hand is calculated a correlation coefficient squared. Or, as well as, the sum of squares of residuals (also called the residual sum of squares, SSres) divided by the total sum of squares (proportional to the variance of the data, SStot) minus 1.
One way to illustrare the SSres is to draw squares between linear line and single data points in a way that the square would touch the linear line. The sum of the squares are SSres.
The picture above is a screenshot from Wikipedia.
Lastly, if the model explains only 5% of the variance of chosen dependent variable (outcome, y), it means that the residuals, everything else except the chosen independent variables (x), are explaining the rest 95% of the variance. Meaning, that we were not able to successfully detect the whole phenomena.
In the final assignment, I will use the first model (model3) as example
whereas, deep learning is not significant indicator, but strategic and surface learning variables are, but the R^2 is very low (2%)
par(mfrow = c(2,2))
plot(my_model3, which = c(1,2,5))
Residual vs Fitted
This graph illustrates the residuals non-linear patterns. If the
red line is roughly horizontal then we can assume that the residuals
follow a linear pattern. If the residuals are not linear, we
would need to consider non-parametric statistical approaches to
investigate the relationship between the variables. Sometimes, there
might be relationship between the variables, even though it would not be
linear. This plot helps us to detect any other possible non-linear
relationship (parabolic/quadratic/polynominal, expotential, steps,
etc.)
Based on the graph, in model 3 the line seem to be fairly
horizontal, so we can claim that the residuals are following linear
patter along with the indicators (dependent,
x-variables).
NOTE. This are just “raw” residuals, not
standardized
Normal Q-Q
This graph is illustrating if the residuals of the regression model are
the normal distributed. In a perfect world, the dots would align
with the linear line, indicating that the residuals are in deed normally
distributed. Each data point is presented in this picture
(dot).
In model 3 it looks like that some observations from the begin and end of the data set are not in line with the linear model. However, overall it is roughly following the line, so we can confirm that the residuals are normally distributed.
Residual vs Leverage This graph is mainly used to spot influential data points, aka outliers, or single data points that could have an impact on our model. This graph can also be used to examine heteroskedasticity (different variance based on different independent variables) which can often be caused by an outlier. We can also investigate non-linearity with this graph.
See more Rummerfield & Berman, 2017, page 3
In our exmaple,
-we have some observations that are below -2 (y-axis) e.g., observations 145, 35, and 19 - The the average leverage is 4/166 ≈ 0.024 then any data point beyond 0.0482 or ≈ 0.05 (2 × 0.024) has a high leverage value. In our model 3 there a some obervations passed 0.5 (x-axis), which we could drop out. - However, our we cant even see the Cook’s distance contour line and these outliers are not too far away from the suggested cut-off lines.
Overall, we can conclude that our data does not have any (or only few) influential data points.
End of Assignment 2: Tasks and Instructions
(more chapters to be added similarly as we proceed with the course!)